Tip: You will see quoted sections like this throughout the template to help you construct your report. Make sure that you remove these notes before you finish and submit your project!
Tip: One of the requirements of this project is that your code follows good formatting techniques, including limiting your lines to 80 characters or less. If you’re using RStudio, go into Preferences > Code > Display to set up a margin line to help you keep track of this guideline!
For this exploratory data analysis we are having a look at loan listings data from a web service called Prosper to try to figure out who are using the service, why they are taking a loan, and what eventually happens to that loan.
Since the original data set contains over 80 variables I have picked out a subset which we will use for our analysis based on the above stated questions. Initially some light data wrangling was also made to either make the data set more readable and to handle NA values.
Tip: In this section, you should perform some preliminary exploration of your data set. Run some summaries of the data and create univariate plots to understand the structure of the individual variables in your data set. Don’t forget to add a comment after each plot or closely-related group of plots! There should be multiple code chunks and text sections; the first one below is just to help you get started.
Let’s start by having a look at the summary statistics for the data to see what we have to work with.
## ListingCreationDate Term
## Min. :2005-11-09 20:44:28 Min. :12.00
## 1st Qu.:2008-09-19 10:02:14 1st Qu.:36.00
## Median :2012-06-16 12:37:19 Median :36.00
## Mean :2011-07-09 08:07:23 Mean :40.83
## 3rd Qu.:2013-09-09 19:40:48 3rd Qu.:36.00
## Max. :2014-03-10 12:20:53 Max. :60.00
##
## LoanStatus ClosedDate
## Current :56576 Min. :2005-11-25 00:00:00
## Completed :38074 1st Qu.:2009-07-14 00:00:00
## Chargedoff :11992 Median :2011-04-05 00:00:00
## Defaulted : 5018 Mean :2011-03-07 20:21:21
## Past Due (1-15 days): 806 3rd Qu.:2013-01-30 00:00:00
## (Other) : 1266 Max. :2014-03-10 00:00:00
## NA's : 205 NA's :58848
## BorrowerRate Occupation EmploymentStatus
## Min. :0.0000 Length:113937 Length:113937
## 1st Qu.:0.1340 Class :character Class :character
## Median :0.1840 Mode :character Mode :character
## Mean :0.1928
## 3rd Qu.:0.2500
## Max. :0.4975
##
## EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup
## Min. : 0.00 Mode :logical Mode :logical
## 1st Qu.: 19.00 FALSE:56459 FALSE:101218
## Median : 60.00 TRUE :57478 TRUE :12719
## Mean : 89.64
## 3rd Qu.:130.00
## Max. :755.00
##
## DebtToIncomeRatio IncomeRange TotalProsperLoans
## Min. : 0.000 $25,000-49,999:32192 Min. :0.0000
## 1st Qu.: 0.140 $50,000-74,999:31050 1st Qu.:0.0000
## Median : 0.220 $100,000+ :17337 Median :0.0000
## Mean : 0.276 $75,000-99,999:16916 Mean :0.2755
## 3rd Qu.: 0.320 Not displayed : 7741 3rd Qu.:0.0000
## Max. :10.010 $1-24,999 : 7274 Max. :8.0000
## NA's :8554 (Other) : 1427
## LoanOriginalAmount LoanOriginationDate MonthlyLoanPayment
## Min. : 1000 Min. :2005-11-15 00:00:00 Min. : 0.0
## 1st Qu.: 4000 1st Qu.:2008-10-02 00:00:00 1st Qu.: 131.6
## Median : 6500 Median :2012-06-26 00:00:00 Median : 217.7
## Mean : 8337 Mean :2011-07-21 03:18:19 Mean : 272.5
## 3rd Qu.:12000 3rd Qu.:2013-09-18 00:00:00 3rd Qu.: 371.6
## Max. :35000 Max. :2014-03-12 00:00:00 Max. :2251.5
##
## InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
## Min. : 0.00000 Min. : 0.00 Min. : 1.00
## 1st Qu.: 0.00000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 0.00000 Median : 0.00 Median : 44.00
## Mean : 0.02346 Mean : 16.55 Mean : 80.48
## 3rd Qu.: 0.00000 3rd Qu.: 0.00 3rd Qu.: 115.00
## Max. :33.00000 Max. :25000.00 Max. :1189.00
##
## ListingCategory
## Length:113937
## Class :character
## Mode :character
##
##
##
##
Based on the above numerical data our typical loan taker is a first time prosper user with equal possibility to be a homeowner or not, taking a loan over 36 to 40 months with an interest rate of around 19%. The typical size of a loan is $6500.
Next we plot the numbers of occurrences for the nominal variables in our data set.
Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.
From the above plots we can see that most of the loan takers are employed but the type of occupation is seldom given with the vague “Professional” and “Other” occupation types both being in the top ten. A majority of loans are still being repaid but there are also an substantial amount of past due, defaulted or charged-off loans.
Further, the income range looks to be fairly normalized with an expected value somewhere around $50,000. Lastly, we have the listing categories for the loan listings and we can see that roughly half of the the reasons given for the loans through prosper is debt consolidation followed by the rather vague “Not Available” and “Other” categories in the top three.
Now, after have gone through and had an initial look at all the variables in the data set let’s revisit and plot some of the more interesting numerical variables to see how they are distributed over time.
By plotting the above values we discovered some interesting facts such as that the loan term probably are artificially locked at one, three or five years. We also saw that the usual amounts being borrowed are grouped around even $5000 numbers with a maximum at $35000. Lastly, the loan origination dates clearly shows effects from the 2008 recession and also an peak in new loans later years which we still aren’t able to explain. Let’s proceed with some further analysis before investigating inter-variable relations in the bivariate section.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
The original data set contains more variables than what we could practically cover in one go so to narrow them down we posed some questions about the data. Let’s revisit these questions to see if we made any discoveries worth noting already.
First, to see who is using the service we can look at the following variables mentioned:
Based on the summary data and plots presented, our typical loan taker is employed, with an unspecified occupation, and probably an income of $25,000 to $50,000. He/she is currently an homeowner and have debts of about a ratio of 0.22 of their income.
The reason for the loan is most likely debt consolidation with home improvement and business lying as distant seconds among the specified reasons as seen in the histogram with ranked listing categories.
To see what eventually happened with the loan we can have a look at the loan status bin plot giving an overview over the different statuses for all the loans in the data set. Out of a little over 100,000 loan listings we have a little over 10,000 that have been defaulted or charged-off(> 150 days overdue with no reasonable expectation of sufficient payment).
Loan origination dates vs. closed dates together with the data on overdue, defaulted, or charged-off loans. Using these features together with the above calculated active loans variable it would be interesting to dive deeper and see how, and when, the 2008 recession affected the loans taken. Further, since it seems like the service has enjoyed some explosive growth going in to 2014 it would be very interesting to see what these added loans are and perhaps why they have increased.
Another features worth looking in to is the borrower rates which mostly pikes my interest due to the unclear form of the distribution. Investigating what variables are correlated and how they affect this blob of values centered somewhere around 0.2 would be very interesting and perhaps a good candidate for a regression model and analysis.
Using the Loan origination dates together with the closing dates I calculated a new variable called active listings to show the volume of current loans on the service. The calculation where made by taking the difference between originated and closed loans for each date during a period between 2005 and 2014 and the calculate the cumulative sum of those differences.
In addition to make sure that all variables were read in using the correct data type, I also choose how to handle NA values. For NA values for nominal data was substituted with an preexisting category being best suited to improve the readability of following histograms. Numeral values where set to 0 for ordinal variables where NAs where present this in order to ease calculations in the analysis while not influencing other statistical measures.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
To continue follow our interesting findings regarding the effects of the 2008 recession on active loan volumes let’s plot the change in loan volumes for the entire period. We will use the monthly deltas to make the graph easier to read.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your data set. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the data set.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!